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Abstract 

We prove the following strong hardness result for learning: Given a distribution of labeled 
examples from the hypercube such that there exists a monomial consistent with (1 — e) of 
the examples, it is NP-hard to find a halfspace that is correct on (1/2 + e) of the examples, 
for arbitrary constants e > 0. In learning theory terms, weak agnostic learning of monomials 
is hard, even if one is allowed to output a hypothesis from the much bigger concept class of 
halfspaces. This hardness result subsumes a long line of previous results, including two recent 
hardness results for the proper learning of monomials and halfspaces. As an immediate corollary 
of our result we show that weak agnostic learning of decision lists is NP-hard. 

Our techniques are quite different from previous hardness proofs for learning. We define 
distributions on positive and negative examples for monomials whose first few moments match. 
We use the invariance principle to argue that regular halfspaces (all of whose coefficients have 
small absolute value relative to the total £2 norm) cannot distinguish between distributions 
whose first few moments match. For highly non- regular subspaces, we use a structural lemma 
from recent work on fooling halfspaces to argue that they are "junta-like" and one can zero 
out all but the top few coefficients without affecting the performance of the halfspace. The 
top few coefficients form the natural list decoding of a halfspace in the context of dictatorship 
tests/Label Cover reductions. 

We note that unlike previous invariance principle based proofs which are only known to give 
Unique-Games hardness, we are able to reduce from a version of Label Cover problem that 
is known to be NP-hard. This has inspired follow-up work on bypassing the Unique Games 
conjecture in some optimal geometric inapproximability results. 
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1 Introduction 



Boolean conjunctions (or monomials), decision lists, and halfspaces are among the most basic 
concept classes in learning theory. They are all long-known to be efficiently PAC learnable, when 
the given examples are guaranteed to be consistent with a function from any of these concept 
classes [44, 7, 41]. However, in practice data is often noisy or too complex to be consistently 
explained by a simple concept. A common practical approach to such problems is to find a predictor 
in a certain space of hypotheses that best fits the given examples. A general model for learning that 
addresses this scenario is the agnostic learning model [22, 27]. An agnostic learning algorithm for 
a class of functions C using a hypothesis space Ti is required to perform the following task: Given 
examples drawn from some unknown distribution, the algorithm must find a hypothesis in Ti that 
classifies the examples nearly as well as is possible by a hypothesis from C. The algorithm is said 
to be a proper learning algorithm if C = 71. 

In this work we address the complexity of agnostic learning of monomials by algorithms that 
output a halfspace as a hypothesis. Learning methods that output a halfspace as a hypothesis such 
as Perceptron [42], Winnow [36], Support Vector Machines [45] as well as most boosting algorithms 
are well-studied in theory and widely used in practical prediction systems. These classifiers are 
often applied to labeled data sets which are not linearly separable. Hence it is of great interest to 
determine the classes of problems that can be solved by such methods in the agnostic setting. In 
this work we demonstrate a strong negative result on agnostic learning by halfspaces. We prove 
that non-trivial agnostic learning of even the relatively simple class of monomials by halfspaces is 
an NP-hard problem. 

Theorem 1.1. For any constant e > 0, it is ^P-hard to find a halfspace that correctly labels 
(1/2 -|- e)-fraction of given examples over {0,1}" even when there exists a monomial that agrees 
with a (1 — e)-fraction of the examples. 

Note that this hardness result is essentially optimal since it is trivial to find a hypothesis with 
agreement rate 1/2 — output either the function that is always or the function that is always 
1. Also note that Theorem 1.1 measures agreement of a halfspace and a monomial with the given 
set of examples rather than the probability of agreement of h with an example drawn randomly 
from an unknown distribution. Uniform convergence results based on the VC dimension imply that 
these settings are essentially equivalent (see for example [22, 27]). 

The class of monomials is a subset of the class of decision lists which in turn is a subset of the 
class of halfspaces. Therefore our result immediately implies an optimal hardness result for proper 
agnostic learning of decision lists. 

Previous work 

Before describing the details of the prior body of work on hardness results for learning, we note that 
our result subsumes all these results with just one exception (the hardness of learning monomials 
by t-CNFs [34]). This is because we obtain the optimal inapproximability factor and allow learning 
of monomials by the much richer class of halfspaces. 

The results of the paper are noteworthy in the broader context of hardness of approximation. 
Previously, hardness proofs based on the invariance principle were only known to give Unique-Games 
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hardness. In this work, we are able to harness invariance principles to show NP-hardness result by 
working with a version of Label Cover whose projection functions are only required to be unique- 
on-average. This could be one potential approach to revisit the many strong inapproximability 
results conditioned on the Unique Games conjecture (UGC), with an eye towards bypassing the 
UGC assumption. Such a goal was achieved for some geometric problems recently [21]; see Section 
2.3. 

Agnostic learning of monomials, decision lists and halfspaces has been studied in a number of 
previous works. Proper agnostic learning of a class of functions C is equivalent to the ability to 
come up with a function in C which has the optimal agreement rate with the given set of examples 
and is also referred to as the Maximum Agreement problem for a class of function C. 

The Maximum Agreement problem for halfspaces is equivalent to the so-called Hemisphere 
problem and is long known to be NP-complete [24, 17]. Amaldi and Kann [1] showed that Maximum 
Agreement for halfspaces is NP-hard to approximate within factor. This was later improved 
by Ben-David et al. [5], and Bshouty and Burroughs [9] to approximation factors and ||, 
respectively. An optimal inapproximability result was established independently by Guruswami and 
Raghavendra [20] and Feldman et al. [15] showing NP-hardness of approximating the Maximum 
Agreement problem for halfspaces within (1/2 -|- e) for every constant e > 0. The reduction in [15] 
requires examples with real-valued coordinates, whereas the proof in [20] also works for examples 
drawn from the Boolean hypercube. 

The Maximum Agreement problem for monotone monomials was shown to be NP-hard by 
Angluin and Laird [2], and NP-hardness for general monomials was shown by Kearns and Li [28]. 
The hardness of approximating the maximum agreement within was shown by Ben-David et 
al. [5]. The factor was subsequently improved to 58/59 by Bshouty and Burroughs [9]. Finally, 
Feldman et al. [14, 15] showed a tight inapproximability result, namely that it is NP-hard to 
distinguish between the instances where (1 — e)-fraction of the labeled examples are consistent with 
some monomial and instances where every monomial is consistent with at most (l/2-|-e)-fraction of 
the examples. Recently, Khot and Saket [34] proved a similar hardness result even when a t-CNF 
is allowed as output hypothesis for an arbitrary constant t (a t-CNF is the conjunction of several 
clauses, each of which has at most t literals; a monomial is thus a 1-CNF). 

For the concept class of decisions lists, APX-hardness (or hardness to approximate within some 
constant factor) of the Maximum Agreement problem was shown by Bshouty and Burroughs [9]. 
As mentioned above, our result subsumes all these results with the exception of [34]. 

A number of hardness of approximation results are also known for the complementary problem 
of minimizing disagreement for each of the above concept classes [27, 23, 3, 8, 14, 15]. Another 
well-known evidence of the hardness of agnostic learning of monomials is that even a non-proper 
agnostic learning of monomials would give an algorithm for learning DNF — a major open problem 
in learning theory [35]. Further, Kalai et al. proved that even agnostic learning of halfspaces with 
respect to the uniform distribution implies learning of parities with random classification noise — 
a long-standing open problem in learning theory and coding [25]. 

Monomials, decision lists and halfspaces are known to be efficiently learnable in the presence 
of more benign random classification noise [2, 26, 29, 10, 6, 12]. Simple online algorithms like 
Perceptron and Winnow learn halfspaces when the examples can be separated with a significant 
margin (as is the case if the examples are consistent with a monomial) and are known to be robust 
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to a very mild amount of adversarial noise [16, 4, 18]. Our result implies that these positive results 
will not hold when the adversarial noise rate is e for any constant e > 0. 

Kalai et al. gave the first non-trivial algorithm for agnostic learning monomials in time 2*^^^^ 
[25]. They also gave a breakthrough result for agnostic learning of halfspaces with respect to the 
uniform distribution on the hypercube up to any constant accuracy (and analogous results for a 
number of other settings). Their algorithms output linear thresholds of parities as hypotheses. In 
contrast, our hardness result is for algorithms that output a halfspace (which is a linear threshold 
of single variables). 

Organization of the paper: We sketch the idea of our proof in Section 2. We define some 
probability and analytical tools in Section 3. In Section 4 we define the dictatorship test, which is 
an important gadget for the hardness reduction. For the purpose of illustration, we also show 
why this dictatorship test already suffices to prove Theorem 1.1 assuming the Unique Games 
Conjecture [30]. In Section 5, we describe a reduction from a variant of the Label Cover problem 
to prove Theorem 1.1 under the assumption that P ^ NP. 

Notation: We use to encode "False" and 1 to encode "True". We denote pos(t) : M — > {0, 1} 
as the indicator function of whether t ^ 0; i.e., pos(t) = 1 when t ^ and pos(t) = when t < 0. 

For X = (xi, X2, . . . , Xn) G {0, 1}", w G M", and 6* € M, a halfspace h{x) is a Boolean function of 
the form pos(i(; ■ x — 6); a monomial (conjunction) is a function of the form /\jg5 Si, where S C [n] 
and Si is the literal of Xi which can represent either Xi or '3^2 J cl disjunction is a function of the form 
Vies^i- One special case of monomials is the function f{x) = Xi for some i G [n], also referred to 
as the i-ih dictator function. 



2 Proof Overview 

We prove Theorem 1.1 by exhibiting a reduction from the fc-LABEL Cover problem, which is 
a particular variant of the Label Cover problem. The /c-Label Cover problem is defined as 
follows: 

Definition 2.1. For positive integer M,N that M ^ N and k ^ 2, an instance of /c-Label 
Cover C{G{V,E),M,N,{-K'^'''^\e £ E,v £ e}) consists of a k-uniform connected (multi-)hypergraph 
G{V,E) with vertex set V and an edge multiset E; a set of functions {tt''"'^}^^]^. Every hyperedge 
e = {vi, . . . ,Vk) is associated with a k-tuple of projection functions {vr'"*'^}^^;^ where -k^^^^ : [M] — )■ 
[N]. 

A vertex labeling A is an assignment of labels to vertices A : 1/ — t- [M] . A labeling A is said to 
strongly satisfy an edge e if n'"^'^{h.{vi)) = 'K'"^'^{A.{vj))) for every Vi,Vj G e. A labeling A weakly 
satisfies edge e if Tr^^'^{A{vi)) = TT'"^'^{A{vj))) for some Vi,Vj E e, fj / Vj. 

The goal in Label Cover is to find a vertex labeling that satisfies as many edges (projection 
constraints) as possible. 
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2.1 Hardness assuming the Unique Games conjecture 

For the sake of clarity, we first sketch the proof of Theorem 1.1 with a reduction from the k- 
Unique Label Cover problem which is a special case of A;-Label Cover where M = N and ah 
the projection functions {vr'^'^lu G e, e € E} are bijections. The following inapproximability result 
[33] for A;-Unique Label Cover is equivalent to the Unique Games Conjecture of Khot [30] . 

Conjecture 2.2. For every constant r] > and a positive integer k, there exists an integer Rq 
such that for all positive integers R > Rq, given an instance C{G{V, E) , R, R, {-rr^'^le & E,v £ e}) 
it is fiP-hard to distinguish between, 

• strongly satisfiable instances: there exists a labeling A -.V ^ [R] that strongly satisfies 1 — krj 
fraction of the edges E. 

• almost unsatisfiable instances: there is no labeling that weakly satisfies -^ji fraction of the 
edges. 

Given an instance C of A;-Unique Label Cover, we will produce a distribution T) over labeled 
examples such that the following holds: if £ is a strongly satisfiable instance, then there is a 
disjunction that agrees with the label on a randomly chosen example with probability at least 1 — e, 
while if C is an almost unsatisfiable instance then no halfspace agrees with the label on a random 
example from T> with probability more than \ + e- Clearly, such a reduction implies Theorem 1.1 
assuming the Unique Games Conjecture but with disjunctions in place of conjunctions. De Morgan's 
law and the fact that a negation of a halfspace is a halfspace then imply that the statement is also 
true for monomials (we use disjunctions only for convenience). 

Let C be an instance of /e-Unique Label Cover on hypergraph G = {V, E) and a set of labels 
[R\. The examples we generate will have |y| x R coordinates, i.e., belong to {0, l}'^'^^. These 
coordinates are to be thought of as one block of R coordinates for every vertex v £ V . We will 
index the coordinates of a; G {0, Ijl^l^^ as a; = {x^v'^)^^y^r&[R]- 

For every labeling A : 1/ — t- [i?] of the instance, there is a corresponding disjunction over 
|Q^l}|y|xi? gi^g^ 

h{x) = \l xi^^^^\ 

V 

(r) 

Thus, using a label r for a vertex v is encoded as including the literal Xv in the disjunction. Notice 
that an arbitrary halfspace over {0, Ijl^l^-^ need not correspond to any labeling at all. The idea 
would be to construct a distribution on examples which ensures that any halfspace agreeing with at 
least ^ + e fraction of random examples somehow corresponds to a labeling of A weakly satisfying 
a constant fraction of the edges in C. 

Fix an edge e = (vi, . . . ,Vk). For the sake of exposition, let us assume tt'"^''^ is the identity 
permutation for every i £ [k]. The general case is not anymore complicated. 

For the edge e, we will construct a distribution on examples De with the following properties: 

• All coordinates x\ for a vertex v ^ e are fixed to be zero. Restricted to these examples, the 
halfspace h can be written as h{x) = pos(^jgj^](w^-, — 9). 
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• For any label r G [R], the labeling A{vi) = . . . = A{vk) = r strongly satisfies the edge e. 

(r) 

Hence, the corresponding disjunction \'i,= [k]Xvi needs to have agreement ^ 1 — e with the 
examples from D^. 

• There exists a decoding procedure that given a halfspace h outputs a labeling for C such 
that, if h has agreement ^ ^ + e with the examples from V^, then A^ weakly satisfies the edge 
e with non-negligible probability. 

For conceptual clarity, let us rephrase the above requirement as a testing problem. Given 
a halfspace h, consider a randomized procedure that samples an example {x, b) from the distri- 
bution Dg, and accepts if h{x) = b. This amounts to a test that checks if the function h cor- 
responds to a consistent labeling. Further, let us suppose the halfspace h is given by h{x) = 
pos {"^y^yiwvjXy) — 6*) . Define the linear function : {0, 1}^ — )• M as fv{xv) = {wy,x^). Then, 
we have h{x) = pos(^^,gy fv{Xv) - 9). 

For a halfspace h corresponding to a labeling A, we will have fv{xi,) dictator 
function. Thus, in the intended solution every linear function fy associated with the halfspace h is 
a dictator function. 

Now, let us again restate the above testing problem in terms of these linear functions. For 
succinctness, we write fi for the linear function /„.. We need a randomized procedure that does 
the following: 

Given k linear functions fi, ■ ■ ■ , fk '■ {0, 1}^ — )• M, queries the functions at one point 
each (say Xi, . . . ,Xk respectively), and accepts if pos(^*L]^ fi{xi) — 6) = b. 

The procedure must satisfy, 

• (Completeness) If each of the linear functions fi is the r'th dictator function for some r G [i?], 
then the test accepts with probability 1 — e. 

• (Soundness) If the test accepts with probability \ + e, then at least two of the linear functions 
are close to the same dictator function. 

A testing problem of the above nature is referred to as a Dictatorship Testing and is a recurring 
theme in hardness of approximation. 

Notice that the notion of a linear function being close to a dictator function is not formally 
defined yet. In most applications, a function is said to be close to a dictator if it has influential 
coordinates. It is easy to see that this notion is not sufficient by itself here. For example, in the 
linear function pos(10^™a;i + X2 — 0.5), although the coordinate X2 has little infiuence on the linear 
function, it has significant infiuence on the halfspace. 

We resolve this problem by using the notion of critical index (Definition 3.1) that was introduced 
in [43] and has found numerous applications in the analysis of halfspaces [37, 40, 13]. Roughly 
speaking, given a linear function /, the idea is to recursively delete its influential coordinates until 
there are none left. The total number of coordinates so deleted is referred to as the critical index 
of /. Let Cr(ifi) denote the critical index of lUj, and let Cr{wi) denote the set of Criwi) largest 
coordinates of Wi. The linear function / is said to be close to the i'th dictator function for every i 
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in Criwi). A function is far from every dictator if it has critical index - no influential coordinate 
to delete. 

An important issue is that the critical index of a linear function can be much larger than the 
number of influential coordinates and cannot be appropriately bounded. In other words, a linear 
function can be close to a large number of dictator functions, as per the definition above. To counter 
this, we employ a structural lemma about halfspaces that was used in the recent work on fooling 
halfspaces with limited independence [13] . Using this lemma, we are able to prove that if the critical 
index is large, then one can in fact zero out the coordinates of Wi outside the t largest coordinates 
for some large enough t, and the agreement of the halfspace h only changes by a negligible amount! 
Thus, we first carry out the zeroing operation for all linear functions with large critical index. 

We now describe the above construction and analysis of the dictatorship test in some more 
detail. It is convenient to think of the k queries xi, . . . ,Xk as the rows of a A; x i? matrix with {0, 1} 
entries. Henceforth, we will refer to matrices {0, l}'^^^ and their rows and columns. 

We construct two distributions Vq,Vi on {0, 1}'^ such that for s G {0, 1}, we have Pr^eVs [^i=i^i = 
s] ^ 1 — e/2 for e = Ofc(l) (this will ensure the completeness of the reduction, i.e., certain disjunc- 
tions pass with high probability). Further, the distributions DqjDi will be carefully chosen to have 
matching first four moments. This will be used in the soundness analysis where we will use an 
invariance principle to infer structural properties of halfspaces that pass the test with probability 
noticeably greater than 1/2. 

We define the distribution on matrices {0,1}'^^^ by sampling R columns independently 
according to P^, and then perturbing each bit with a small probability e/2. We define the following 
test (or equivalently, distribution on examples): given a halfspace h on {0, 1}'^^^, with probability 
1/2 we check h{x) = for a sample x G Vq, and with probability 1/2 we check h{x) = 1 for a 
sample x G T)^. 

Completeness: By construction, each of the R disjunctions ORj{x) = \/^^-^^x^^^^ passes the test 
with probability at least 1 — e (here denotes the entry in the i'th row and j'th column of x). 

Soundness: For the soundness analysis, suppose h{x) = pos{{w,x) — 6) is a halfspace that 
passes the test with probability at least 1/2 + e. The halfspace h can be written in two ways by 
expanding the inner product {w,x) along rows and columns, i.e., h{x) = pos(^*L]^(iUi, Xj) — 6) = 
pos(X]iLi(if a^^*-*) - 0)- Let us denote fi{x) = {wi,Xi). 

First, let us see why the linear functions {wi, Xi) must be close to some dictator. Note that we 
need to show that two of the linear functions are close to the same dictator. 

Suppose each of the linear functions fi is not close to any dictator. In other words, for each 
i, no single coordinate of the vector wi is too large (contains more than r-fraction of the ^2 mass 
ll^^ilb of vector Wi ). Clearly, this implies that no single column of the matrix w is too large. 

Recah that the halfspace is given by /i (a;) = pos(^jg[j:jj(i(;(-'\ a;^-'))— 0). Here Z(a;) = ('"^ a^^-' ■*) — 

is a degree 1 polynomial into which we are substituting values from two product distributions 
and T>^ . Further, the distributions T>q and T>i have matching moments up to order 4 by design. 
Using the invariance principle, the distribution of l{x) is roughly the same, whether x is from Dq 
or 2?^. Thus, by the invariance principle, the halfspace h is unable to distinguish between the 
distributions Vq and with a noticeable advantage. 
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Further, suppose no two linear functions fi are close to the same dictator, i.e., Cr{wi)riCr{wj) = 

0. In this case, we condition on the values of for j G Cr{wi). Since Cr{wi) n Criwj) = 0, this 
conditions at most one value in each column. Therefore, the conditional distribution on each column 
in cases Pq and Di still have matching first three moments. We thus apply the invariance principle 
using the fact that after deleting the coordinates in Cr{wi), all the remaining coefficients of the 
weight vector w are small (by definition of critical index). This implies that Cr{wi) n Cr{wj) ^ 
for some two rows i,j and finishes the proof of the soundness claim. 

The above consistency-enforcing test almost immediately yields the Unique Games hardness of 
weak learning disjunctions by halfspaces via standard methods. 

2.2 Extending to NP-hardness 

To prove NP-hardness as opposed to hardness assuming the Unique Games conjecture, we reduce 
a version of Label Cover to our problem. This requires a more complicated consistency check, and 
we have to overcome several additional technical obstacles in the proof. 

The main obstacle encountered in transferring the dictatorship test to a Label Cover-based 
hardness is one that commonly arises for several other problems. Specifically, the projection con- 
straint on an edge e = {u, v) maps a large set of labels TZ = {ri, . . . , r^} corresponding to a vertex 
ti to a single label r for the vertex v. While composing the Label Cover constraint (n, v) with the 
dictatorship test, all labels in TZ have to be necessarily equivalent. In several settings including this 
work, this requires the coordinates corresponding to labels in TZ to be mostly identical! However, on 
making the coordinates corresponding to TZ identical, the prover corresponding to u can determine 
the identity of edge (u, thus completely destroying the soundness of the composition. In fact, 
the natural extension of the Unique Games-based reduction for MaxCut [32] to a corresponding 
Label Cover hardness fails primarily for this reason. 

Unlike MaxCut or other Unique Games-based reductions, in our case, the soundness of the 
dictatorship test is required to hold against a specific class of functions, i.e, halfspaces. Harnessing 
this fact, we execute the reduction starting from a Label Cover instance whose projections are 
unique on average. More precisely, a smooth Label Cover (introduced in [31]) is one in which for 
every vertex u, and a pair of labels r,r', the labels {r, r'} project to the same label with a tiny 
probability over the choice of the edge e = {u,v). Technically, we express the error term in the 
invariance principle as a certain fourth moment of the coefficients of the halfspace, and use the 
smoothness to bound this error term for most edges of the Label Cover instance. 

2.3 Bypassing the Unique Games conjecture 

Unlike previous invariance principle based proofs which are only known to give Unique-Games 
hardness, we are able to reduce from a version of the Label Cover problem, based on unique 
on average projections, that can be shown to be NP-hard. It is of great interest to find other 
applications where a weak uniqueness property like the smoothness condition mentioned above 
can be used to convert a Unique-Games hardness result to an unconditional NP-hardness result. 
Indeed, inspired by the success of this work in avoiding the UGC assumption and using some of 
our methods, follow-up work has managed to bypass the Unique Games conjecture in some optimal 
geometric inapproximability results [21]. To the best of our knowledge, the results of [21] are the 
first NP-hardness proofs showing a tight inapproximability factor that is related to fundamental 
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parameters of Gaussian space, and among the small handful of results where optimality of a non- 
trivial semidefinite programming based algorithm is shown under the assumption P 7^ NP. We hope 
that this paper has thus opened the avenue to convert at least some of the many tight Unique-Games 
hardness results to NP-hardness results. 



3 Preliminaries 

In this section, we define two important tools in our analysis: i) critical index, ii) invariance 
principle. 

3.1 Critical Index 

The notion of critical index was first introduced by Servedio [43] and plays an important role in 
the analysis of halfspaces in [37, 40, 13]. 

Definition 3.1. Given any real vector w = {uj^-^\vj^'^\ . . . ,vj^^^) G R". Reorder the coordinates 
by decreasing absolute value, i.e., jw^'^^l ^ juj^^^^l ^ ... ^ |«;(*")| and denote cr| = I'l'^^^^-'P- 
For ^ r ^ 1, the T-critical index of the vector w is defined to be the smallest index k such 
^ rcTfc. If no such k exists (Vk, \w^'^'=^\ > ruk), the r-critical index is defined to be +00. The 
vector w is said to be t -regular if the T-critical index is 1. 

A simple observation from [13] is that if the critical index of a sequence is large then the sequence 
must contain a geometrically decreasing subsequence. 

Lemma 3.2. (Lemma 5.5 in [13]) Given a vector w = (w^*^)"^]^ such that \w^^^\ ^ \w^'^^ ^ . . . ^ 
]w;(")|, if the T-critical index of the vector w is larger than I, then for any 

k^^'^l ^ (7j ^ (Vl - T2)J-Vi ^ (v/1-t2)J->«|/t. 

In particular, if j > i + (4/r2) ln(l/r) then \w^^^ ^ |w;^*^|/3. 

For a T-regular weight vector, the following lemma bounds the probability that its weighted 
sum falls into a small interval under certain distributions on the points. The proof is in Appendix 
B. 

Lemma 3.3. Let w E M" be a T-regular vector w, and ^ jui'-^^p = 1. V is a distribution over 
{0, 1}". Define a distribution V on {0, 1}" as follows: to generate y from V, first sample x from 
D and then define, 

{x*-*^ with probability 1 — 7 

random bit with probability 7. 

Then for any interval [a,b], we have 



Pr 



{w,y) G [a,b] 



^ 4\b -a\ 4r ^ 

^ H h 2e 2r 

\/7 \/7 
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Intuitively, by the Berry-Esseen Theorem, {w^y) is r close to the Gaussian distribution if each 
y(') is a random bit; therefore we can bound the probability that {w, y) falls into the interval [a, b]. 
In above lemma, each y^*) has probability 7 to be a random bit, then 7 fraction of y^"^^ is set to be 
a random bit and we can similarly bound the probability that (to, y) falls into the interval [a, b]. 

Definition 3.4. For a vector w G M", define set of indices Ht{w) C [n] as the set of indices 
containing the t biggest coordinates of w by absolute value. Suppose its r-critical index is Cr, define 
set of indices Cr{w) = Hc^{w). In other words, Cr{w) is the set of indices whose deletion makes 
the vector w to be r-regular. 

Definition 3.5. For a vector w G M" and a subset of indices S Q [n], define the vector Trur\cate{w , S) G 

M" as: 



(Truncate(w, S)) 



ifieS 
otherwise 



As suggested by Lemma 3.2, a weight vector with a large critical index has a geometrically 
decreasing subsequence. The following two lemmas use this fact to bound the probability that the 
weighted sum of a geometrically decreasing sequence of weights falls into a small interval. First, 
we restate Claim 5.7 from [13] here. 

Lemma 3.6. [Claim 5.7, [13]] Let w = {w'^^\ . . . ,w^'^'>) be such that \w^^^\ ^ \w^'^^ . . . ^ ^ 

and \w^^^^^\ ^ l^^l for 1 ^ i ^ T — 1 . Then for any interval I = [a — a + ^^^^] of length 
^"'g , there is at most one point x G {0, 1}"^ such that {w, x) G /. 

Lemma 3.7. Let w = {w'^^\ . . . ,w^'^^) be such that \w''^'>\ ^ \w'^'^^\... ^ \w^'^^\ ^ and < 
l^^l for 1 ^ i ^ T — 1. Let D be a distribution over {0, 1}"^. Define a distribution T) on {0, 1}"'" 
as follows: To generate y from T>, sample x from T> and set 



{x^*^ with probability 1 — 7 

random bit with probability 7. 



Then for any 6 gM we have 



Pr 



{w,y) G [e-—^,e + 



s; 1-- 
2 



7nT 



Proof. By Lemma 3.6, we know that for the interval J 



R 1 ~r fl 



, there is at most one 



point r G {0,1}"^ such that {w,r) G J. If no such r exists then clearly the probability is zero. 

fl) (2) (T) 

On the other hand, suppose there exists such an r, then {w, y) G J only ii {y\ ,yi , . . . ,y\ ) = 
(r(i),... ,r(^)) holds. 

Conditioned on any fixing of the bits x, every bit y^^^ is an independent random bit with 
probability 7. Therefore, for every fixing of x, for each i G [T], with probability at least 7/2, y^*^ 
is not equal to r^. Therefore, Pr[?/(^) = r^^),?/^^) ^ ^{2)^ ^ ^ _^y{T) ^ ^(T)j ^ (1 _ 2)^. □ 
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3.2 Invariance Principle 

While invariance principles have been shown in various settings by [39, 11, 38], we restate a version of 
the principle well suited for our application. We present a self-contained proof for it in Appendix C. 

Definition 3.8. A function ^(x) : M — )■ M for which fourth-order derivatives exist everywhere on 
R is said to be K-bounded if ^ K for all t G R. 

Definition 3.9. Two ensembles of random variables V = {pi, . . . ,Pk) and Q = (qi, . . . ,qk) are said 
to have matching moments up to degree d if for every multi-set S of elements from [k\, \S\ ^ d, we 
have E[l\i^sP,]= Billing q^]. 

Theorem 3.10. (Invariance Principle) Let A = {A^^^ , . . . , A^^^},B = {B^^\ . . . , B^^^} be fam- 
ilies of ensembles of random variables with A^'J' = {a^\ . . . , a^',''} and S^*^ = . . . , ft^*''}; satis- 
fying the following properties: 

• For each i £ [R], the random variables in ensembles {A^'^\B^'^^) have matching moments up 
to degree 3. Further all the random variables in A and B are bounded by 1. 

• The ensembles A^^^ are all independent of each other, similarly the ensembles B^^^ are inde- 
pendent of each other. 



Given a set of vectors I = ■ ■ ■ , G R'''-), define the linear function I 



X • • • xJ 



as 



Then for a K-bounded function ^ : B 
E ^(l{A)-9 



Z(^)= 

tern 
-^R we have 



E 

B 



^(liB) - 9 



^A'j:iiz«iif 



for all 6 > 0. Further, define the spread function c(a) corresponding to the ensembles A,B and the 
linear function I as follows, 



{Spread Function: )For 1/2 > a > 0, let 

c{a) = max ( sup Pr^ 1{A) G [9 — a., 6 + a] 



sup Pre 



l{B)e 



a,6 + a] 



then for all 9, 



E[pos(Z(^) 



E[pos(Z(e)-^)] 



^ ^ ie[ii] 



Roughly speaking, the second part of the theorem states that pos function can be thought of 
as ^-bounded with error parameter c(a). 
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4 Construction of the Dictatorship Test 

In this section we describe the construction of the dictatorship test which wih be the key ingredient 
in the hardness reduction from fc-UNiQUE Label Cover. 

4.1 Distributions Vq and Vi 

The dictatorship test is based on following two distributions Vq and Vi defined on {0, l}'^. 
Lemma 4.1. For A; G N, there exists two probability distributions Vq, Vi on {0, 1}*^ such that for 

X — (xi , . . . , Xk) , 

2 1 
^^xr^Voi^very x/ is 0} ^ 1 -= and Prjcr^xii { et'ery xi is 0} ^ 

while matching moments up to degree 4, i.e., yi,j,m,n £ [k] 



Proof. For e = take Vi to be the following distribution: 

1. with probability (1 — e), randomly set exactly one of the bit to be 1 and all the other to be 0; 



The distribution Vq is defined to be the following distribution with parameter ei, e2, £3, £4 to be 
specified later: 

1. with probability 1 — (ei + £2 + £3 + £4), set every bit to be zero; 





E \^XiXjX^Xn\ — E \XiXjX^X'fi^ 




2. with probability £1, independently set every bit to be 1 with probability -j^; 



3. with probability £2, independently set every bit to be 1 with probability 



2 . 



4. with probability £3, independently set every bit to be 1 with probability 



3 . 



5. with probability £4, independently set every bit to be 1 with probability 
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From the definition of DqjDi, we know that Prx^j}g[eveiy is 0] ^ 1 — (ei + £2 + £3 + £4) and 
Prtc-Di [every Xj is 0] ^ e = 

It remains to determine each ej. Notice that the moment matching conditions can be expressed 
as a hnear system over the parameters ei, e2, £3, £4 as fohows: 



i=l 



^1/3. 



i=l k3 4 ^3 



We then show that such a linear system has a feasible solution ei, £2, £3, £4 > and Yli=i ^ 
2/Vk . 

To prove this, by applying Cramer's rule, 

(1 



£1 



e)A + E-=ii(7T 



E4 e / i \2 
i=l4i^J 



fc7 ■ 



2 
1 

4 
— T 
fe3 



A;3 
16 

fc3 



1 
-J 



2 
—T 



3 

—T 



fc3 fc3 



4 
1 

fc3 



11 

i 



3 


4 


1 


1 






9 


16 


A;3 


fc3 


27 


64 


~r 


~3" 


81 


256 


fc3 





With some calculation using basic linear algebra, we get 



£1 = £/4 + 



(i-£)A 4 4 A 



1 

¥ 
¥ 

k^ 



kj fc3 k^ 



4 



fc3 fc3 



i 



2 

¥ 



fc3 



3 



11 

i 



77f + 0(7^)- 
4VA; /fc3 
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□ 



For large enough k, we have ^ ei ^ iv^' similar calculation, we can bound €2, £3, 64 by 
Overall, we have ei + £2 + £3 + £4 ^ 2/ 

We define a "noisy" version ofP;,(6G{0,l}) below. 
Definition 4.2. For b G {0, 1}, define the distribution Vf, on {0, 1}^' as follows: 

• First generate x G {0, 1}^ according to Vi,. 

• For each i £ [k], 



Vi 



Xi with probability 1 — 

uniform random bit Ui with probability 



Observation 4.3. Do and T>i also have matching moments up to degree 4. 

Proof. Since the noise is defined to be an independent uniform random bit, when calculating mo- 
ments of y, such as E^Jt/j^j/jj " ' ' Uidl^ substitute yi by (1 — 7)xi + ^7. Therefore, a degree 
d moment of y can be expressed as a weighted sum of moments of x of degree up to d. Since 
Dq and Pi have matching moments up to degree 4, it follows that Dq and Pi also have the same 
property. □ 

The following simple lemma asserts that conditioning the two distributions Dq and Di on the 
same coordinate xj being fixed to value b results in conditional distributions that still have matching 
moments up to degree 3. 

Lemma 4.4. Given two distributions Vo,Vi on {0, l}'^ with matching moments up to degree d, for 
any multi-set S of elements from [k], \S\ ^ d — 1, j € [k] and c € {0, 1}. 

E[JJxi I Xj = c] = Ejn^* I = c]- 
Proof. For the case c = 1 and any b £ {0, 1}, 

E[x,n^.] = Ejn^* I ^3 = i]prpok.' = 1] = Ejn^^' i = ^ni- 

Therefore, 

For the case c = 0, replace Xj with x'j = 1 — Xj. It is easy to see that Vo and Vi still have 
matching moments and conditioning on Xj = is the same as conditioning on x'j = 1. Hence we 
can reduce to the case c = 1. □ 
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4.2 The Dictatorship Test 

Let i? be a positive integer. Based on the distributions Vq and Pi, we define the dictatorship test 
as follows: 



1. Generate a random bit 6 G {0, 1}. 

2. Generate x G {0, l}'^'^ (which is also written as {a;p^}je[A;] je[R]) from V^. 

3. For each i G [k],j G [R], 

(i) j ^i'^ with probability 1 — -p; 

* 1 random bit with probability p-. 

4. Output the labelled example {y, b). Equivalently, if h denotes the halfspace, ACCEPT 
if h{y) = b. 



We can also view y as being generated as follows: i) With probability ^, generate a negative 
sample from distribution "Dq; ii) With probability ^, generate a positive sample from distribution 

The dictatorship test has the following completeness and soundness properties. 
Theorem 4.5. (completeness) For any j G [R], h{y) = V^^^yp^ passes with probability ^ 1 — 

Theorem 4.6. (soundness) Fix t = ^ and t = (3 ln(l/r) + In i?) + [4/^^ In /e] ln(l/r)] . Let 
h{x) = pos{{w,y) — 6) be a halfspace such that Ht{wi) n Ht{wj) = for all i,j G [k]. Then the 
halfspace h{y) passes the dictatorship test with probability at most ^ + O(^). 

Proof. (Theorem 4.5) If x is generated from V^^, we know that with probability at least 1 — all 

the bits in Xg"'^ . . . , x^^^} are set to 0. By union bound, with probability at least 1 — — i, 

{y? ) 2/2''^ • • • ' Hk^} *° 0' which case the test passes as vf^^y^ — 0. If x is generated 

from V^, we know that with probability at least 1 — 7=, one of the bits in {x^ '^ ^ } is set 

to 1 and by union bound one of {yi \ ^2"'^ ■ ■ ■ ■, Vk^} is set to 1 with probability at least 1 — — -^j 

h (i) 

in which case the test passes since VjLj^y^ = 1. Overall, the test passes with probability at least 
4.3 Proof of Soundness (Theorem 4.6) 

We will prove the contrapositive statement of Theorem 4.6: if some h{y) passes the above dictator- 
ship test with high probability, then we can decode for each Wi {i £ [k]), a small list of coordinates 
and at least two of the lists will intersect. 

The proof is based on two key lemmas (Lemmas 4.7, 4.8). The first lemma states that if a 
halfspace passes the test with good probability, then two of its critical index sets Cr{wi),Cr{u}j) 
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must intersect. This would immediately imply Theorem 4.6 if Cr is less than t. The second lemma 
states that every halfspace can be approximated by another halfspace with critical index less than 
t; so we can assume that Cr is small without loss of generality. 

Let h{y) be a halfspace function on {0,1}'^^ given by h{y) = pos{{w,y) — 6). Equivalently, 
h{y) can be written as 



h{y) = pos(^ ^ {w^^\y^^^) - Oj = pos(^ ^ (wi, y^) - 

je[R] ie[k] 

where w^^") e R'^ and Wi G R-^. 

Lemma 4.7. (Common Influential Coordinates) For r = let h{y) be a halfspace such that for 
all i j [k], we have Cr{wi) n Cr{wj) = . Then 



B[h{y)] - B[h{y)] 

VP 



■^0 



< o 



Proof. Fix the following notation, 



Zj = Tvuncate{wi, Cr{wi)) 
= Jruncate{yi, Cr{wi)) 

S = Si,S2, ... ,Sk 



y 



c 



Wj — w 



c 



= yf,y^,...,y^ 

I = h,l2, ■ ■ ■ , Ik- 



We can rewrite the halfspace h{y) as h{y) = pos\^{l,y^) + {s,y) — 6j. Let us first normalize the 
halfspace h{y) so that X]ie[fc] ll^dP — 1- We now condition on a possible fixing of the vector y*^. 
Under this conditioning and for y chosen randomly from the distribution Vq, define the family of 
ensembles A = A^^\ . . . , A^^^ as follows: 

A^J> = {yp^l i G [k] for which j ^ Cr{wi)] 



Similarly define the ensemble B = B^^\ . . . ,B^^^ using y chosen randomly from the distribution 
P^. Further let us denote l^^^ = {li \ ■ ■ ■ apply the invariance principle (Theorem 

3.10) to the ensembles A,B and the linear function I. For each j G [R], there is at most one 
coordinate i G [k] such that j G Cr{wi). Thus, conditioning on y*-" amounts to fixing of at most 
one variable yp'' in each column {y,p^}ig[A:] • By Lemma 4.4, since Vq and "Di have matching moments 
up to degree 4, we get that A^^^ and B^^^ have matching moments up to degree 3. Also notice 
that maxjgj^] ^ 'T'll^ilb ^ ''"ll'lb (as li is a r-regular) and each is set to be a random 

unbiased bit with probability p-; by Lemma 3.3, the linear function I and the ensembles A, B 
satisfy the following spread property for every 6' G M: 



Pre 



1{A) G [9' -a,9' + a] ^ c(a) 
1{B) G [9' -a,e' + a]\ ^ c(a), 
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1 -J 

where c{a) ^ 8ak + 4rA; + 2e (by setting 7 = and |6 — a| = 2a in Lemma 3.3). Using the 

invariance principle (Theorem 3.10) this impUes: 



E 



c 



^o(i) E ll'^'^llf + 2c(a) (1) 



i&[R] 



By definition of the critical index, we have maxjg[/j] ^ ■7"||'i||2- Using this, we can bound 
E»6[i?] 11^^*^111 as follows: 



iG[i?] ie[fc]ie[iJ] ie[fc] ^ ' 



0Vj:i|r.||i0V||z||N^. 



In the final inequality in above calculation, we used the fact that t = ^ and ||Z||2 = 1- Let us 
choose a = and (1) is therefore bounded by 0(1/ k) for all settings of y'-' . Averaging over all 
settings of y we get that 

E[h{y)]-B[h{y)] 



■f)R 
^0 



□ 



The above lemma asserts that unless some two vectors Wi,Wj have a common influential co- 
ordinate, the halfspace h{y) cannot distinguish between Vq and P^^. Unlike with the traditional 
notion of influence, it is unclear whether the number of coordinates in Cr{wi) is small. The following 
lemma yields a way to get around this. 

Lemma 4.8. (Bounding the number of influential coordinates) Let t he set as in Theorem 4-6. 
Given a halfspace h[y) and r € [A;] such that \Cr{wT.)\ > t, define h{y) = pos(^^gj^j(i(;j, y^) — 6) 
as follows: Wy. = Truncate(i(;^., Ht{wr)) and Wi = Wi for all i ^ r. Then, 



E[h{y)] - E[h{y)] 



-riR 



^0 



^ ^ and 



E [h{y)] - E [h{y)] 



1 



Proof. Without loss of generality, we assume r = 1 and \ w\^^\ ^ \w^i^\ ^ • • • ^ l""^!^^!- particular, 
this implies Ht{wi) = {1, . . . , t}. Set T = In A;] . Define the subset G of Ht{wi) as 

G = {gi\gi = l + i\{4/T^) ln(l/r)l , ^ * ^ T}. 



i9^+l)\ 



Therefore, by Lemma 3.2, | is a geometrically decreasing sequence such that \wi 

\w[^'^\/3. Let H = Ht{wi)\ G. Fix the following notation: 

Wi = Truncate(i(;i, G), Wi = Truncate(i(;i, H), w^* = Truncate(wi, {t + 1, . . . , n}). 
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Similarly, define the vectors Vi ,yi ,yi* ■ We now rewrite the halfspace functions h{y) and h{y) 
as: 

k 

h{y) = pos(^(to„yi) + {w^,y^) + {w^ ,y^) + {w>\y>') - 9 

1=2 



h{y) = pos( + {w^,yi) + {w^,yi' 

1=2 



Notice that for any y, h{y) ^ h{y) implies 



1=2 

By Lemma 3.2, we know that 

^2 



(9t)|2 



(1 - r2)*-5T 



2N-^(31n(l/r)+lnR) 



(1-t2)T^ 



1^1 II2 



(2) 



T 



Using the fact that i?||iuf*||2 ^ we can get that ^ y/T\w'f'^\ ^ Combin- 

ing the above inequality with (2) we see that, 



Pr 

'^0 



< Pr [1 ^(m„y,) + ,yf ) + («^f ,yf ) - ^1 ^ ^ 



{9t)| 



Pr [(tx;f,t/f) G \e' 



where Q' = — ^}^i=2^^^->y'^) ~ ^V\) + ^- ^^o^ ^'^y fixing of the value of Q' € M, it induces a 
certain distribution on y^ ■ However, the ^ noise introduced in is completely independent. 
This corresponds to the setting of Lemma 3.7, and hence we can bound the above probability by 
(1 — 2p-) ^ The result follows from averaging over all values of d' . 

With the two lemmas above, we now prove the soundness property. 



□ 



Proof. (Theorem 4.6) The probability of success of is given by ^ + 1( E^fl[/i(y)] — E^R[/i(y)]) . 
Therefore, it suffices to show that E^i{[/i(y)] — 'E^a[h{y)] = O(^). 

Define I = {r \ Cr{wr) ^ t}. We discuss the following two cases. 

1. 1 = 0; i.e., Vi G [k], Cr{wi) ^ t. Then for aU Ht{wi)nHt{wj) = implies Cr{wi)nCr{wj) = 



By Lemma 4.7, we thus have 



E^n[h{y)]-E^n[hiy)] =0(1) 
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1. Sample an edge e = {vi, . . . , Vk) G E. 

2. Generate a random bit 6 € {0, 1}. 

3. Sample x G {0, 1}^=^ from . 

4. Define y G {0, as follows: 

(a) For each v ^ {vi,... , Vk}, Uv = 0. 

(b) For each i G [k] and j G [R], ylf^ = xf 

5. Output the example {y,b). 



Figure 1: Reduction from /c-Unique Label Cover 

2. / 7^ 0. Then for all r G /, we set Wr = Trur\cate{wr, Ht{wr)) and replace Wr with Wj. in h to 
get a new halfspace h' . Since such replacements occur at most k times and by Lemma 4.8 every 
replacement changes the output of the halfspace on at most fraction of examples, we can bound 
the overall change by x p- = p That is 



f)R T)R 



1 



B[h'{y)] - B[h{y)] ^ -, B[h'{y)] - B[h{y)] ^ -. (3) 



1 



k 



Also notice that for h' and all r G [k], the critical index of Wr (i-e-, \Cr{'Wr)\) is less than t. This 
reduces the problem to Case 1, and we conclude E^fj[/i'(y)] — Fif,R[h'{y)] = 0{l/k). Along with 
(3) this finishes the proof of Theorem 4.6. 

□ 

4.4 Reduction from A;-Unique Label Cover 

With the dictatorship test defined, we now describe briefly a reduction from A;-Unique Label 
Cover problem to agnostic learning of monomials, thus showing Theorem 1.1 under the Unique 
Games Conjecture (Conjecture 2.2). Although our final hardness result only assumes P ^ NP, we 
describe the reduction to A;-Unique Label Cover for the purpose of illustrating the main idea of 
our proof. 

Let C{G{V, E) , R, R, {tt'^''^\v £V,e£ E}) be an instance of /c-Unique Label Cover. The re- 
duction is defined in Figure 4.4. It will produce a distribution over labeled examples: {y, b) where 
y G {0, l}l^l><-f^ and label b G {0, 1}. We wiU index the coordinates oi y € {0, by (for 
w V,i ^ R) and denote y^, (for w V) to be the vector {yw\yw \ ■ ■ ■ , 



Proof of Theorem 1.1 assuming Unique Games Conjecture Fix k = r] = and a 



1 

positive integer R > [(2/c)''^] for which Conjecture 2.2 holds. 
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Completeness: Suppose that A : y — )• [i?] is a labeling that strongly satisfies 1 — kr] fraction 
of the edges. Consider disjunction h{y) = \/ ^^y yi^^^^^ ■ For at least 1 — krj fraction of edges 
e = {vi,V2, ■ ■ ■ ,Vk) S E, 7r''i'^(A(ui)) = • • • = 7r*''"^(A(ffc)) = r. Let us fix such a choice of edge e in 
step 1. As all coordinates of y outside of {yvi, ■ ■ ■ ,yvk} are set to in step 4(a), the disjunction 
reduces to \^ i£[k]yvf^^^'^^ = \^i£[k]X^f ^. By Theorem 4.5, such a disjunction agrees with every {y,b) 
with probability at least 1 — Therefore h{y) agrees with a random example with probability 
at least (1 - ^)(1 - A;r?) ^ 1 - ^ - fcr/ ^ 1 - e. 

Soundness: Suppose there exists a halfspace h{y) = X^^,gv-(''A'u, Z/t;) that agrees with more than 
i + e ^ i + ^ fraction of the examples. Set t = /c^''(31n(A;'^) + Ini?) + lAk^^lnk'^] • lAk^lnk] = 

0[k^^ Ini?) (same as in Theorem 4.6). Define the labeling A using the following strategy : for each 
vertex v £ V randomly pick a label from Ht{'W^). 

By an averaging argument, for at least | fraction of the edges e £ E generated in step 1 of the 
reduction, h{y) agrees with the examples corresponding to e with probability at least 5 + f • We 
will refer to such edges as good. By Theorem 4.6 for each good edge e £ E, there exists i,j G [k], 
such that 7r'"^''^[Ht{wi,J) n ■K'"^'^(^Ht{wy^)') / 0. Therefore the edge e € i? is weakly satisfied by the 
labeling A with probability at least p-. Hence, in expectation the labeling A weakly satisfies at least 

I • = 0( ^33 j^ ) ^ -^ji fraction of the edges (by the choice of R and t). 

5 Reduction from Label Cover 

In this section, we describe a reduction from a /c-Label Cover with an additional smoothness 
property to the problem of agnostic learning of disjunctions by halfspaces. This will give us Theo- 
rem 1.1 without assuming the Unique Games Conjecture. 

5.1 Smooth A;-Label Cover 

Our reduction use the following hardness result for /c-Label Cover (Definition 2.1) with the 
additional smoothness property. 

Theorem 5.1. There exists a constant 7 > such that for any integer parameter J,u ^ 1, it is NP- 
hard to distinguish between the following two types 0/ A; -Label Cover C{G{V, E), M, N,{7r^'^\e E 
E,v £ e}) instances with M = 7('^+^)" and N = 2^7-^" .■ 

1. (Strongly satisfiable instances) There is some labeling that strongly satisfies every hyperedge. 

2. (Instances that are not 2k'^2~^^ -weakly satisfiable) There is no labeling that weakly satisfies 
at least 2A;^2~'^" fraction of the hyperedges. 

In addition, the fc-LABEL Cover instances have the following properties: 

• (Smoothness) for a fixed vertex v and a randomly picked hyperedge containing v, 

yi,j£[M],Pr[7r-'%i)=7T^''ij)]^l/J. 
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• Pick a hyperedge e = {vi,V2, ■ ■ ■ ,Vk) G E with corresponding projections tt'"'^''^, . . . ,tt'"'"'^ 
[M] ^ [N]. 

• Generate a random bit 6 G {0, 1}. 

• Sample x G {0, l}'^^ from . 

• Generate y G {0, Ijl^l^^-^ as follows: 

1. For each v ^ e, = 0. 

2. For each i G [k], set y^. G {0, 1}*^ as follows: 



0) i ' ^"'^^ with probability 1 — p- 
1 random bit with probability 



Vv. 

Output the example (y, h) or equivalently ACCEPT if h{y) = b. 



Figure 2: Reduction from A;-Label Cover 

• For any mapping ■k'"'^ and any number i G [N], we have |(7r^'^)~^(i)| ^ d = 4"; i.e., there are 
at most d = 4" elements in [M] that are mapped to the same number in [N] . 

The proof of the above theorem can be found in Appendix D. 

In the rest of the paper, we will set u = k and therefore d = 4^. Also we set the smoothness 
parameter J = d^"^ = 4^^^^. 

5.2 Reduction from Smooth /c-Label Cover 

The starting point is a smooth A;-Label Cover C{G(y,E),M,N,{Tr'''''\e £ E,v € e}) with M = 
j{J+i)u ^ 2"7"^" as described in Theorem 5.1. Fi ffure 5.2 illustrates the reduction from 

/c-Label Cover C{G{V, E), N, M,{Tr^''^\e £ E,v £ e}) that given an instance of A;-Label Cover 
C produces a random labeled example. We refer to the obtained distribution on examples as £. 

5.3 Proof of Theorem 1.1 

We claim that our reduction has the following completeness and soundness properties. 

Theorem 5.2. • Completeness: If C is a strongly- satis fiable instance of smooth A;-Label 
Cover, then there is a disjunction that agrees with a random example from £ with probability 
at least 1 - O(^). 

• Soundness: If C is not 2k'^2~"'^ -weakly satisfiable and is smooth with parameters J = A^"^^ 
and d = 4^, then there is no half space that agrees with a random example from £ with 
probability more than ^ +0(-^). 
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Combining the above theorem with Theorem 5.1 we get that for k = 0(1/ e'^), we obtain our 
main result: Theorem 1.1. 

It remains to check the correctness of the completeness and soundness claims in Theorem 5.2. 
First let us prove the completeness property. 

Proof. (Proof of Completeness) Let A be the labeling that strongly satisfies C. Consider disjunction 

,V2,... ,Vk) be any hyperedge and let £e be the distribution £ 
restricted to the examples generated for e. With probability at least 1 — 1/ k, yv}'"''^ = ' (^M) fQj, 
every i £ [k]. As e is strongly satisfied by A, for all i,j £ [k], TT'"''^{A{vi)) = ■K'"^''^{A{vj)). Therefore, 
as in the proof of Theorem 4.5, we obtain that h{y) agrees with a random example from Eg with 
probability at least 1 — 0(1/ Vk). Labeling A strongly satisfies all edges and therefore we obtain 
that h{y) agrees with a random example from £ with probability at least 1 — 0(1/\/A;)- Q 

The more complicated part is the soundness property which we prove in Section 5.4. 
5.4 Soundness Analysis 

Proof Idea The main idea is similar to the proof of Theorem 4.6 although it is more technically 
involved. Notice that the reduction in Figure 5.2 produces examples such that yi] , j/^^ are "almost 
identical" copies when TT^'-'^{ji) = 7r^'^'^(j2). Further for different edges e, the coordinates of y will 
be grouped in different ways, such that each group will have almost identical copies. 

To handle these additional complications, the first step of the proof is to show that almost all 
the hyperedges in smooth /c-Label Cover satisfy a certain "niceness" property. After that we 
generalize the proofs of Lemma 4.7 and Lemma 4.8 under the weaker assumption that most of the 
hyperedges are "nice". 

The formal definition of "niceness" and the proof that most of the edges are "nice" appear in 
Section 5.4.1. The generalization of Lemma 4.7 appears in Section 5.4.2. The generalization of 
Lemma 4.8 appears in Section 5.4.3. All these results are put together into a proof of Theorem 5.2 
in Section 5.4.4. 

5.4.1 Most of the edges are "nice" 

Let h{y) be a halfspace that agrees with more than ^ + -^-fraction of the examples. Suppose, 

h{y) = pos(^ y,„) - 9^ . 

vev 

Let r = and let 

= Truncate(i(;t,, Cr{wy)), ly = — s^. 

Definition 5.3. A vertex v £ V is said to be (3-nice with respect to a hyperedge e £ E containing 
it if 

ie[N] jeiT-^{i) 
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where vr : [M] — )• [A^] is the projection associated with vertex v and hyperedge e. A hyperedge 
e = {vi,V2, ■ ■ ■ ,Vk) is j3-nice, if for every i G [k], the vertex vi is (3-nice with respect to e. 

Lemma 5.4. The fraction of2T-nice hyperedges in E is at least 1 — 0{l/k). 

Proof. By definition, we know that l^, is r-regular vector. Denote I^, = \i \ |jf~ip" ^ By 
definition |/| ^ d^ . Notice there are at most d^^ pairs of values in / x /. By the smoothness 
property of the A:-Label Cover instance, for any vertex u, at least 1 — ^ fraction of the hyperedges 
incident on v have the following property: for any i,j G J^,, 7r*'''^(i) ^ t^^'^U)- If all the vertices in 
a hyperedge have this property we call it a good hyperedge. By an averaging argument, we know 
that among all hyperedges at least 1 — = 1 — ^^1 — O(^) fraction is good. 

We will show all these good hyperedges are also 2T-nice. For a given good hyperedge e, a vertex 
G e, vr = tt'"''^ and i G [A^l, there is at most one j G vr" (i) such that Vf-p- ^ -m- 

\\lv\\2 " 

Based on the above property, we will show 

i£[N] j£n-'^(i) 



Notice that 

E( E m"=Y. E mi^'^i^-m (4) 

and the sum of all the terms with ji = j2 = j's = is HitjUl. 

For all other terms \li''^hi''^hi^^hi''*'^ | with Ji, j2; jsj J4 that are not all equal, there is at least one 
(r G [4]) smaller than Therefore, can be bounded by 

Overall, expression (4) can be bounded by 

^'^^Il't'll2 + E ^^'^''^^^ (since |vr"-^(z)| < d, each I'^p appears at most 4d^ times) 

je[M] 

^(r^ + 4— )||/i,||2 is T-regular vector, so ^ t"||'i)||2 for all j G [M] ) 

^ 27" I I ly I I 2 • 



□ 
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Let us fix a 2r-nice hyperedge e = (fi, . . . ,Vk)- As before let £e denote the distribution on 
examples restricted to those generated for hyperedge e. We will analyze the probability that the 
halfspace h{y) agrees with a random example from Eg. 

Let tt^'^'^jTt'"'^'^, . . . ,7r*'*'^ : [M] — )• [N] denote the projections associated with the hyperedge e. 
For the sake of brevity, we shall write Wi,yi,li instead of Wy^,yy^,ly^. For all j G [A^] and i G [k], 
define 

yp> = Truncate(2/„(7r-"^)-ni)). 
Similarly, define vectors wf ,lf and s,[ . 

Notice that for every example {y,b) in the support of S^i Vv = for every vertex v ^ e. 
Therefore, on restricting to examples from we can write: 

h{y) = pos(^ -6*). 

ie[fc] 



5.4.2 Common Influential Variables (generalization of Lemma 4.7) 

Lemma 5.5. Let h[y) he a halfspace such that for all i ^ j G [k\, we have 7r'"^''^{Cr{wi)) D 
7T''i'''{Cr{wj)) = 0. Then 



E[h{y)\b = 0]-B[h{y)\b=l] 

c-e 



(5) 



y^ = y?,y^',..., 



if 



I — h,l2, ■ ■ ■ , Ik- 



Proof. Fix the following notation: 

yf = Jruncate{yi, Cr{wi)) 

S = Si,S2,...,Sk 



We can rewrite the halfspace h{y) as h{y) = posy{s,y^) + {l,y) — 6j. Let us first normalize the 
weights of h{y) so that X^jgj/j] ll^illi ~ 1- Let us condition on a possible fixing of the vector y*-" . 
Under this conditioning and also for 6 = 0, define the family of ensembles A = A^^\ . . . , A^^^ as 
follows: 

A^^^ = I i £ [k],r £ [M] such that 7r''-^(r) = j and r ^ Cr{wi)^ 

Similarly define the ensemble B = B^^\ . . . , B^^^ for the conditioning 6=1. Now we shall apply 
the invariance principle (Theorem 3.10) to the ensembles A,B and the linear function l{y): 

l{y)= Y,{l^^\y^^^). 

As we prove in Claim 5.6 below, the ensembles A, B have matching moments up to degree 3. 
Furthermore, by Lemma 3.3, the linear function I and the ensembles A, B satisfy the following 
spread property: 



liA) G [9' -a,e' + a] ^ c(q) 



Pre 



1{B) G [9' -a,9' + a] 



^ c(a) 
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for all 6' G M, where c(a) = Safe + 4rfc + 2e zPT^ (by setting 7 = and |6 — a| = 2a in Lemma 
3.3). 

Using the invariance principle (Th. 3.10), this implies: 





- E 


pos 1 




B 





^O(^) E ll^^''^llt + 2c(«)- (6) 

Take a to be p- and recall that r = -pj. In Claim 5.7 below we show that 

\\l^'^^\\^2Tk\ 

i6[JV] 

The above inequality holds for an arbitrary conditioning of the values of y'~' . Hence, by averaging 
over all settings of we prove (5). □ 

Claim 5.6. The ensembles A and B have matching moments up to degree 3. 

Let us suppose for a moment that y was generated by setting yiJ^ = x- , that is without 

adding any noise. By Lemma 4.1, the first four moments of random variable y conditioned on 
6 = agree with the first moments of random variable y conditioned on 6 = 1. As we showed in 
Observation 4.3, even with noise, the first four moments of y remain the same when conditioned 
on 6 = and 6=1. Finally, 7r''^'''(Cr(wi)) n vr^J (Cr (wj ) ) = for aU i / j e [A;]. Hence for each 
j G [A^], conditioning on y^ fixes bits in at most one row of A^^^ . Formally, for every j G [A^], 
there exists at most one i G [k] such that y\^^ and y^ have shared variables. Therefore, by Lemma 
4.4, A and B have matching moments up to degree 3. 

Claim 5.7. 

i6[JV] 

Proof. Since |K-'^||i = H^P^Hii we can write 

j£[N] j&N ie[k] i&[k] j£[N] 



As e = {vi, . . . ,Vk) is a 2r-nice hyperedge, we have X^jg[7v] ^ 2r||Zj||2. By normalization of 

Ae[k] IKII2 



Z, we know J2ip\k] IKII2 — 1- Substituting this into inequality (7) we get the claimed bound. □ 



5.4.3 Bounding the Number of Influential Coordinates (generalization of Lemma 4.8) 

Lemma 5.8. Given a halfspace h[y) = pos(^jg[^](iUi, y,,) — 9) and r G [k] such that \Cr{wr)\ ^ t 

fort=^{\AkHn{2k)] r41n(l/r)l+ln(l/r)+101nd) = 0(A:29), define h{y) = pos{Zielk]{'^i^y^)-^) 
as follows: 
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Wr = Truncate(tUr, Ht{wr)) and Wi = Wi for all i ^ r. 
= 6 — 'E[{ar,yr)\b = 0], for a = w — w. 



Then, 



m{y)\b = 0]-m{y)\b = 0] 



E[h{y)\b = l]-m{y)\b = l] 



Proof. It is easy to see that the matching moments condition imphes that 

Be^[{ar,yr)\b = 0]=EsA{ar,yr)\b = l]. 

Let us show the inequahty for the case 6 = 0, the other inequahty can be derived in an identical 
way. Let £e o denote distribution £e conditioned on b = 0. Without loss of generality, we may 



assume that r = 1 and ^ l^^i^^l 

Define 



^ In particular, this implies Ht{wi) = {!,... 



= E£,„[(a,,,y,,)], ^« = Ee^^,[{al^^,yn]- 
Let us set T = [4/0^ ln(2/c)] and define the subset G = {gi, . . . ,5't} of Ht{wi) as follows: 

G = {gi\gi = l + ir(4/r2) ln(l/r)l , ^ ^ ^ T}. 

Therefore, by Lemma 3.2, is a geometrically decreasing sequence such that \w'^l 

|w[^'V3- Let H = Ht{wi) \ G. Fix the following notation: 

Wi = Truncate(ioi, G), = Truncate(i(;i, H), w^^ = Truncate(i(;i, {t + 1, . . . , n}). 

Similarly, define the vectors yf, yi ,yi^- By definition, we have ai = w^^ . Rewriting the halfspace 
functions h{y),h{y) : 

k 

h{y) = pos( ^(toi, yi) + (lof , ) + {w^ ,y^) + (ai, y>*) - 6^ , 

k 

h{y) = pos(^(iOi,yi) + (tof ,yf ) + {w^ ,yi) + m - 

i=2 

By Claim 5.9 below, with probability at most ^ = we have | (ai, j/i)— ^i| ^ (i^||ai||2. Suppose 



{ai,yi) — fill < d'^\\ai\\2, then Claim 5.10 below gives \{ai,yi) — < l/d^\w[^'^^\ < 



3 1^1 I 



Thus, we can write 



Pr^,,, h{y) / h{y) ^ Pr^^,„ {w^,yY) G W 



w 



{9t)\ 



+ 



4A; ■ 



where 6' = —J2i=2i'^i^yi) ~ {'^itVi) — l^i + O. For any fixing of the value of 9' G M, induces 
a certain distribution on yf. However, the p- noise introduced in y'^ is completely independent. 
This corresponds to the setting of Lemma 3.7, and hence we can bound the above probability by 
(1 - l/(2A;2))r + 1/4^ ^ (1 - l/(2A;2))4A:2ln(2fc) ^ i/4fc ^ 1/^2^ □ 
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Claim 5.9. 



{ai,yi) - d \\ai\\2 



d' 



Proof. Write [M] as the union of disjoint sets i?i U ii2 U • • • U Rn where Ri = {tt'"^'^) ^(z). Notice 
every Ri has size at most d, therefore 

Var£,,„((ai,yi)) = ^ Var^,,„ ((af , )) ^ d\\a^^\\l = d\\a^\\l 

is [AT] i(i[N] 

By applying Chebyshev's inequality (Th. A. 3), we have 

Pr£,,o [K«i>?/i) - Mil ^ f^llailb] ^ ^ ^ ^• 

□ 



Claim 5.10. By the choice of the parameters T and t, 

1 



n W ^ ^ |„,,(9t)| 



Proof. By Lemma 3.2, 

' ,(9t)|2 - 



\WY"\' ^ 



(1 _ r2)i-9T) 



|ai||2 ^ 



r 



(l-r2) 



2A-T(l°(VT)+101nd) 



|_ ||2 ^ ^10||_ ||2 

|ai II2 ^ M ||ai 1I2. 



□ 



5.4.4 Proof of Soundness 

Recall that we chose r = and t = 0{k'^^). 

Lemma 5.11. Fix a hyperedge e which is 2T-nice. If for all i ^ j € [k], 7r''"^{Ht{wi)) n 
TT'"^'^(yHt{wj)) = then the probability that half space h{y) agrees with a random example from 
£(, is at most ^ + 0{\). 

Proof. The proof is similar to the proof of Theorem 4.6. Define I = {r | CriWf) > i}- We divide 
the problem into the following two cases. 

1. 1 = 0; i.e., for all z € [A:], Cr{wi) ^ t. Then for any i / j G [A;], Ht{wi) n Ht{wj) = ill implies 
Cr{wi) n Criwj) = 0. By Lemma 5.5, we have 



m{y)\b = 0]-B[h{y)\b=l] 



^ O 



2. / 7^ 0. Then for all r G /, we set Wr = Truncate(iOr, Ht{wr)) and define a new halfspace h' by 
replacing Wy. with in /i. Since such replacements occur at most k times and, by Lemma 
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5.8, every replacement changes the output of the halfspace on at most fraction of examples 
from £e, we can bound the overall change hy k x 



1 _ 1 



That is 



E[h'{y)]- E[hiy)] 

te,0 te,0 



1 



E \h'{y)] - E Hy)] 

te,l te,l 



1 

^ k 



(8) 



For the halfspace h' and for all r G [A;], we have \CT-{wr)\ ^ t, thus reducing to Case 1. 
Therefore 

(9) 



E[h'{y)]-B[h'{y)] 



Combining (8) and (9), we get 



E [h{y)] - E [h{y)] 

te.O te,l 



< o 



In other words, the probability that halfspace h{y) agrees with a random example from £e is at 
mosti + 0(i). □ 



We first recall the soundness statement: 

Proposition 5.12. If C is not a 2k'^2~^^ -weakly satisfiable instance of smooth /c-Label Cover, 
then there is no halfspace that agrees with a random example from £ with probability more than 



+ 



Proof. The proof is by contradiction. We can define the following labeling strategy: for each vertex 
V, uniformly randomly pick a label from Ht{wy). We know that the size of Ht{wyJ is t = 0{k'^^). 

Suppose there exists a halfspace that agrees with a random example from £ with probability 
more than ^ + Then by an averaging argument, for at least ^^-fraction of the hyperedges e, 

h{y) agrees with a random example from £e with probability at least \ + We refer to these 
edges as good. 

Since there is at most 0(l/A;)-fraction of the hyperedges that are not 2T-nice we know that 
at least ^^-fraction of the hyperedges are 2T-nice and good. By Lemma 5.11, for each 2T-nice 
and good hyperedge e there exist two vertices Vi,Vj G e such that 7r^*''^{Ht{wi)) and TT'"^'^{Ht{wj)) 
intersect. Then there is a ^ probability that the labeling strategy we defined will weakly satisfy 
hyperedge e. 

Overall this strategy is expected to weakly satisfy at least -^j^ji = ^{^) fraction of the 
hyperedges. This is a contradiction since C is not ||^-weakly satisfiable. □ 
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Appendix 

A Probabilistic Inequalities 

In the discussion below we will make use of the following well-known inequalities. 

Theorem A.l. (Hoeffding's Inequality) Let x^^\ . . . , x^'"''^ be independent real random variables 
such that G [a^''\b^^^]. Then the sum of these variables S = Y17=i-'^^^^ satisfies 



Pr[|5-E[ 



Theorem A. 2. (Berry-Esseen Theorem) Let xi,X2, . . . ,x„ be i.i.d. random unbiased { — 1, 1} vari- 
ables. Also assume that Yli=i^i — ^ ^''^d. maxi{|ci|} ^ a. Let g denote a unit Gaussian variable 
N{0, 1). Then for any t G R, 



Pr 



[E 



Vv[g ^ t] 



< a. 



Theorem A. 3. (Chebyshev's Inequality) Let X be a random variable with expected value u and 
variance . Then for any real number t > 0, 



Pr[|X-^| ~^t-a\^ l/t^ 



B Proof of Lemma 3.3 



Recall that each y^*^ is generated by the following manner: 



I x^') with probability 1 — 7 ^^^^ 

[random bit with probability 7. 
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Let us define a random vector z G {0, 1}"" based on y. For y generated, if y^*) is generated as a 
copy of x^*^ in (10), then z^*) = 0; if y^*) is generated as a random bit in (10), then z^*) = 1. Let us 
write S = Yl^=i w^^^y^^K Our proof is based on two claims. 



7^ 
2t^ 



Claim B.l. For a T-regular vector w, PrfX^^Li Iw^*-*]^^^*^ ^ 7/2] ^ 1 — 2e 
Claim B.2. For a T-regular vector w, given any a' < 6' € M and any fixing of z^^\ z^'^\ . . . , z^"'\ 
ifT.i=M'^?z^"^ = ct2 > 0, then Pr[S G [a', 6']] ^ + ^. 

Given the above two claims are correct, define event V to be {X]ILi(^^*^)^-2^^*'* ^ 2^ ^'^'^ use 
l[a,6](^) • ~^ {0) 1} to denote the indicator function of whether x falls into interval [a, b]. 

Pr[5 G [a, 6]] = E[1[,,,](S)] = Pr[y] E[l[,,,] (5) | F] + Prh^] E[l[„,,] (5) | -F] 

By Claim B.l, 

Prhy]E[l[„,b](5) I -y] ^ Pr^y] ^ 2e- 



7^ 
2t^ 



By Claim B.2, 



Overall, 



Pr[y]E[i[,,(s)|y]^i^^ + i^. 



Pr[5G[a,6]]<^^ + 4 + 2e-^' 



\/7 \/7 
It remains to verify Claim B.l and Claim B.2. 

To prove Claim B.l, we need to apply the Hoeffding's inequality (see Theorem A.l). 
Notice that (t(;(*))^z(*) G [0, (fi;^*))^] and applying Hoeffding's Inequality, we know 



Pr 



^(«;»)2zW-E 



i=l 



^ nt 



We know EE^^i('UjW)2z(^)] = 7 and Er=i((w^^'^)^)^ ^ max^ {(«;«)2} ELil^^^*^)^ ^ If ^e 
take nt = 7/2, we have 



Pr 



El 

i=l 



s$ 2e ^ 



Therefore, with probability at least 1 — 2e ^ , Z]r=i('"^^*'')^-^*"*'' ^ 2- 

To prove Claim B.2, we need use Berry-Esseen Theorem (See Theorem A. 2). Let us split S into 
two parts: 5' = X^2.=i WiUi and S" = Ylizi=o '^iVi- Since 5 = 5' + S" and S' is independent of 5", 
it suffices to show that Pr [S' G [a', h'\] < ^'^^"'l + ^ for any a', fe' G M. Define = 27/(*) - 1 and 
note that y'^'^ a { — 1, 1} variable. By rewriting 5' using this definition, we have 
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Then 



Pr [S' G [a',b']\ = Pr 



5^ u;«y'»G[a",6"] 

zii)=l 



(11) 



where a" = 2a' — X^2(*)=i ^^"^ ^" ~ ~ S^(»)=i '^^'^ further rewrite the above term 

as 
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We can now apply Berry-Esseen's theorem. Notice that for all the i such that z^^^ = 1, y'(^) is 



distributed as an independent unbiased random { — 1, 1} variable. Also iu.dix.^[i)_-^ 

T 

By Berry-Esseen's theorem, we know that expression (11) is bounded by 
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iV(0,l) ^ 
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Pr 



iV(0,l) sj: 



Using the fact that a unit Gaussian variable falls in any interval of length A with probability at 
most A and noticing that b" — a" = 2(6' — a'), we can bound the above quantity by 



2\b'~a'\ ^ 2t 
VE.(0=i(^»)2 VE.(»=i(^«)^ 



2\b-a\ 2t 
+ — . 



C Proof of Invariance Principle (Th. 3.10) 

We restate our version of the invariance principle here for convenience. 

Theorem 3.10 restated (Invariance Principle) Let^ = { A^^^ . . . , A^^}},fi = {B^^\ . . . 

be families of ensembles of random variables with A^^^ = {a^\ . . . , al' :)}andS« = {6«,...,5®}, 

satisfying the following properties: 

• For each i G [R], the random variables in ensembles (A^*^,S^*^) have matching moments up 
to degree 3. Further all the random variables in A and B are bounded by 1. 

• The ensembles A^'J' are all independent of each other, similarly the ensembles B^^^ are inde- 
pendent of each other. 
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Given a set of vectors I = . . . , G R'^^), define the linear function I : M^i 



X • • • XJ 



as 



l{x)= J](/«,^r«) 
Then for a A'-bounded function ^' : M — )• M we have 



^ Z(^)-^ 
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(12) 



for all ^ > 0. Further, define the spread function c(a) corresponding to the ensembles A, B and the 
linear function I as follows, 



(Spread Function: )For 1/2 > a > 0, let 

c{a) = max ( sup Pr_4 1{A) € [9 — a, 9 + a] 



supPrg 



1{B) e[9-a,e + a] ) 



then for all 9, 



E 

A i 



pos[l{A)-9 



pos[l{B)-9 



(13) 



Proof. Let us prove equation (12) first. Let Xi = {B^^\ . . . , B^'~'^\ B^'\ A^'+'^\ . . . , A^^>}. 
We know that 

m{i{A) - 9)] - m{i{B) - 9)] = mmo) -9)]-e i^mn) - 9)] 

A B Xo Xn 

R 



E [^{m^i) - 9)] - mim) - 9)]. 

— 7 Xi-i Xi 

1=1 



Therefore, it suffices to prove 



X, 



E [*(Z(A',_i) - 9)] - n^{m) -9)]\^ K\\i^m 

Xi 



(14) 



Let yi = {B^^\ . . . , B^'^^\ A^'+^\ . . . , A^^^ and we have Xi = {yi,B^'^ and 
{3^i,A^*}}. Then 



E [^(Z(^i-i) - 9)] - mim) - 0)] = E 

Xi~i Xi J/i 

Notice that 



B mm-i)-o)]- B mm)-o)] 
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and 
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Take 9' = T.i^j^i^i{l^^^ ^ B^^^) + T.i+i^j^R{l^^^ ^ ^^^^) - We can further rewrite equation 
(15) as 

E[ E [vI/((Z«,A«) + ^')]- E (16) 
Using the Taylor expansion of ^, we have that the inner expectation of equation (16) is equal 



to 



A{0 



6 



b{»} 2 6 9/11 



24 
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24 



(17) 



for some 81,82 G M. 

Using the fact that A^*^ and B^^^ have matching moments up to degree 3, we can upper bound 
equation (17) by 



AiiV 24 ' " 
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In the last inequality, we use the fact that is K-bounded and (Z^^^, A^'^) sj: (i^'^B^^^) ^ 

IIZ^^J'lli since all random variables in A,B are bounded by 1. 

Overall, we bound the inner expectation of equation (16) by y| H^''-*^ ||i- This implies equation 
(16) and therefore equation (14) is bounded by establishing equation (12). 

To prove equation (13), we need to use the following lemma. 

Lemma C.l. ([39], Lemma 3.21) There exists an absolute constant C such thatMQ < A < ^, there 
exists -hounded function $;sj : M — [0, 1] which approximates the pos(x) function in the following 
sense: <I>a(0 = 1 for all t > X; ^\{t) = for t < -A. 

By the above lemma, we can find a ^-bounded function such that <^a(Z(A) — 9) is equal to 
pos(Z(A) — 9) except when 1{A) G [9 — a, 9 + a] and ^a{l{B) — 9) is equal to pos(Z(;B) — 9) except 
when 1{B) G [9 — a, 9 + a]. Also for any x G M, |pos(2;) — <l>Q(a;)| ^ 1 as pos(x) and ^a{x) are both 
in [0,1]. 

Overall, we have 
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□ 
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D Hardness of Smooth fc-Label Cover 



First we state the bipartite smooth Label Cover given by Khot [31]. Our reduction is similar to 
the one in [19] but in addition requires proving the smoothness property. 

Definition D.l. A Label Cover problem C{G{W,V,E),M,N,{'k''''"\{w,w) E E}) consists of a 
bipartite graph GiV^W^E) with bipartition V and W. M,N are two positive integers such that 
M > N. There are projection functions vr'''"' : [M] — t- [N] associated with each edge {w,v) G E 
where v £ V,w G W. All vertices in W have the same degree (i.e., W-side regular). For any 
labeling A : y — )• [M] and k:W^ [N], an edge is said to be satisfied if tt'"''^ {K{v)) = K{w). We 
define Opt{C) to be the maximum fraction of edges satisfied by any labeling. 

Theorem D.2. There is an absolute constant 7 > such that for all integer parameters u and J, it 

is NP-hard to distinguish the following two cases: A Label Cover problem C{G{W, V, E), N, M, {7r^''^\(w, v) E 

E}) with M = T^-^+i)" and N = 2"7"'" having 

• Opt{C) = 1 or 

• Opt{C) ^ 2-2t«. 

In addition, the Label Cover has the following properties: 

• for each tt'"''^ and any i £ [N], we have |(vr'''"')~^(i)| ^ 4"'; 

• for a fixed vertex w and a randomly picked neighbor v of w, 

Vi,jG[M],Pr[7r"'"'(i)=7r-'-(i)]^l/J. 

Below we prove Theorem 5.1. 

Proof. Given an instance of bipartite Label Cover C{G{V, W, E), M, N, {vr"'"' | [w, v) S E}), we can 
convert it to a smooth /c-Label Cover instance C as follows. The vertex set of C is V and we 
generate the hyperedge set E' and projections associated with the hyperedges in the following way: 

1. pick a vertex w £ W; 

2. pick a /c-tuple of w^s neighbors vi, . . . ,vi: and add a hyperedge e = {vi, . . . ,Vk) to E' with 
projections ir'"'-'^ = vr"''"' for each i £ [k]. 

Completeness: If Opt{C) = 1, then there exists a labeling A such that for every edge [w, v) £ E, 
tt'"''^{A{v)) = A{w). We can simply take the restriction of labeling A on F for the smooth k- 
Label Cover instance C. For any hyperedge e = {vi,V2, ■ ■ ■ generated hy w £ W, we know 
7r^"^(A(?;i)) = A{w) = ir^^^^Aivj)) for any i,j £ [k]. 
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Soundness: If Opt{C) ^ 2-2t", then we can weakly satisfy at most 2k'^2 "^"-fraction of the 
hyperedges in C' . This can be proved via contrapositive argument. Suppose there is a labehng 
strategy A (defined on V) for the smooth A;-Label Cover that weakly satisfies a ^ 2A;^2~'^" 
fraction of the hyperedges. Extend the labelling to W as follows: For each vertex w £ W and a 
neighbor v £V, let 7r^^^(A(v)) be the label recommended by v to w. Simply assign for every vertex 
w € W, the label most recommended by its neighbours. 

By the fact that A weakly satisfies a- fraction of hyperedges in C' , we know that if we pick a 
vertex w and randomly pick two of its neighbors vi , V2 then 

Pr K--(A(^;i)) = ^^-'"(A(^;2))] ^ ^ ^- 

By an averaging argument, at least ^-fraction of the vertices w G W , will have the following 
property: among all the possible pairs of w's neighbors, at least ^-fraction of pairs recommend the 
same label for w. Let us call such a if to be a nice. It is easy to see that for every nice w, the 
most recommended label is actually recommended by at least ^ fraction of its neighbours. Hence, 
the extended labelling satisfies at least a/k'^ fraction of edges incident at each nice w E W. Using 
Vl^-side regularity, we conclude that the extended labelling satisfies ^ = 4- 2~^'^*'-fraction the edges 
of £ - a contradiction. 



Smoothness of C: For any given vertex v in C, we want so show that if we randomly pick an 
hyperedge e' containing v, then for the projection tt^'^ as defined in C, 

Vi,jG[M],PrK'^'(i)=7r-'^'(i)]^i. 

To see this, notice that all vertices in W have the same degree; picking a projection tt'"'^' using 
the above procedure is the same as randomly picking a neighbor w oi v and using the projection 
^v,w (jgfi]2ed in C. Therefore, 

Vi,i G [M],PrK'^'(i) = 7r-'^'(j) = Pr[vr-'-(z) = 7T^'^{j)] ^ j. 

□ 
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